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Abstract. By means of two simple convexity arguments we are able to develop a gen- 
eral method for proving consistency and asymptotic normality of estimators that are 
defined by minimisation of convex criterion functions. This method is then applied to a 
fair range of different statistical estimation problems, including Cox regression, logistic 
and Poisson regression, least absolute deviation regression outside model conditions, and 

pseudo-likelihood estimation for Markov chains. 

O ' 

^vq ■ Our paper has two aims. The first is to exposit the method itself, which in many 

cases, under reasonable regularity conditions, leads to new proofs that are simpler than 

the traditional proofs. Our second aim is to exploit the method to its limits for logistic 

regression and Cox regression, where we seek asymptotic results under as weak regularity 

conditions as possible. For Cox regression in particular we are able to weaken previously 

published regularity conditions substantially. 

f-H I Key WORDS: argmin lemma approximation, convexity, Cox regression, LAD regression, 

C/^ ■ log-concavity, logistic regression, minimal conditions, partial likelihood, pseudo-likelihood 

^" 

1. Introduction. This paper develops a simple method for proving consistency and asymp- 
totic normality for estimators defined by minimisation of a convex criterion function. Versions of 
the method have been used or partially used by several authors, for various specific occasions, in- 
> ; eluding Jureckova (1977, 1991), Andersen and Gih (1982), Hjort (1986, 1988a), Haberman (1989), 

^ ■ Pollard (1990, 1991), Bickel, Klassen, Ritov and Wellner (1992), Niemiro (1992), but the general 

OO ■ principle has not been widely recognised. 

; . Our aims in this paper are twofold, (i) The primary objective is to explain the basic method, 

L^ ! and to illustrate its use in a fair range of statistical estimation problems. In section 2 we state and 

prove some general theorems about estimators that are defined via some form of convex minimi- 
sation, and in sections 3 and 4 illustrate their use by means of applications to sample quantiles, 
maximum likelihood estimation when the likelihood is log-concave, and least squares and least ab- 
^ , solute deviation linear regression outside model conditions. Similarly sections 5 and 6 treat logistic 

H I and Cox regression, while still further applications are reported in section 7, including Poisson 

regression. The proofs are relatively simple and instructive, at least when regularity conditions 
are kept reasonable, (ii) The second objective is to improve on previously published results, in 
the sense of pruning down the regularity conditions of theorems for two important models, namely 
logistic regression in section 5 and Cox regression in sections 6 and 7A. The two aims are mildly 
conflicting, editorially speaking. We soften the conflict in sections 5 and 6 by writing down first a 
simple version of a theorem with a simple proof, and then a harder version with a harder proof. In 
this way we hope that our article has some pedagogic merits while at the same time also offering 
something to the specialists. 

Instead of treating minimisation as a search for a root of a derivative, we work directly with 
the argmin (a minimising value) of a random function and are able to approximate it with the 
argmin of a simpler random function. In this way we manage to avoid special arguments that are 
often used to prove consistency separately. Convexity essentially buys us both consistency and 
asymptotic normality with the same dollar, and sometimes with cheaper regularity conditions. 
The two convexity lemmas that will be used are as follows. 
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Lemma 1: From pointwise to uniform. Suppose An{s) is a sequence of convex random 
functions defined on an open convex set S in IRF , which, converges in probabihty to some A{s), for 
each s. Then sup^g;^ l^n(-s) — A{s)\ goes to zero in probability, for each compact subset K of S. 

Proof: This is proved in Andersen and Gill (1982, appendix), crediting T. Brown, via 'diag- 
onal subsequencing' and an appeal to a corresponding non-stochastic result (see Rockafellar, 1970, 
Theorem 10.8). For a direct proof, see Pollard (1991, section 6). [] 

A convex function is continuous and attains it minimum on compact sets, but it can be flat at 
its bottom and have several minima. For simplicity we speak about 'the argmin' when referring to 
any of the possible minimisers. The argmin can be selected in a measurable way, as explained in 
Niemiro (1992, p. 1531), for example. 

Lemma 2: Nearness of argmins. Suppose An{s) is convex as in Lemma 1 and is approxi- 
mated by Bn{s). Let an be the argmin of An, and assume that Bn has a unique argmin j3n- Then 
there is a probabihstic bound on how far an can be from j3n: for each (5 > 0, 

Pr{|a„ - ^n\ >5}< Pr{A„(,5) > i/i„(<5)}, (1.1) 

wliere 

A„(<5)= sup \An{s) - Bn{s)\ and /i„(<5) = inf 5„(s) - B„(/3„). (1.2) 

s-/9"l<<5 |s-/3„|=(5 

Proof: The lemma as stated has nothing to do with convergence or indeed with the 'n' 
subscript at all, of course, but is stated in a form useful for later purposes. To prove it, let s be an 
arbitrary point outside the ball around /3„ with radius 5, say s = (3n + lu for a unit vector u, where 
/ > 5. Convexity of An implies 

(1 - 5/1) AniPn) + (S/l) An{s) > An{Pn + 5u). 

Writing for convenience ^„(s) = Bn{s) + rn{s), we deduce 

{5/1) {An{s) -An{Pn)}> An{^n + 5u) - A„(/3„) 

= B„(/3„ + bu) + r„(/3„ + bu) - BnWn) - rn(/3n) 
> hn{6) - 2An{6). 

If A„((5) < -^hniS), then An{s) > An{Pn) for all s outside the (5-ball, which means that the minimiser 
an must be inside. This proves (1.1). 

It is worth pointing out that any norm on IRF can be used here, and that no assumptions need 
to be placed on the Bn function beside the existence of the minimiser f3n- 

The two lemmas will deliver more than mere consistency when applied to suitably rescaled 
and recentred versions of convex processes. 

We record a couple of useful implications of Lemma 2. If An — Bn goes to zero uniformly 
on bounded sets in probability and /3„ is stochastically bounded, then A„((5) — T-p by a simple 
argument. It follows that «„ — /3„ — s-p provided only that l//i„((5) is stochastically bounded for 
each fixed 6. This last requirement says that Bn shouldn't flatten out around its minimum as n 
increases. 

Basic Corollary. Suppose An{s) is convex and can be represented as ^s'Vs + U'^s + Cn + 
rn{s), where V is symmetric and positive definite, Un is stochastically bounded, Cn is arbitrary, 
and rn{s) goes to zero in probability for each s. Then an, the argmin of An, is only Op(l) away 
ifrom Pn = —V~^Un, the argmin of ^s'Vs + U'nS + Cn- If also Un -^d U then an — >d —V~^U. 
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Proof: The function An{s) — U^s — Cn is convex and goes to ^s'Vs in probability for each 
s. By the first lemma the convergence is uniform on bounded sets. Let A„((5) be the supreniuni of 
|r„(s)| over {\s — /?„| < 6}. Then, by Lemma 2, 



Or 



-V-^Un + En, where Pr{|e„| > 6} < Pr{A„(5) > ^k5^} -^ 0. (L3) 



Here k is the smallest eigenvalue of V, and A„((5) — s-p 0, by the arguments used above. [] 

A useful slight extension of this is when An{s) = ^s'VnS + U^s + C„ + r„(s) is convex, with 
a nonnegative definite symmetric Vn matrix that converges in probability to a positive definite V. 
Writing Vn = V + rjn the remainder rjn can be absorbed into rjj(s) and the result above holds. 

2. General results for convex minimisation estimators. This section presents three 
basic theorems about the asymptotic behaviour of estimators that are defined by minimisation of 
some convex criterion function. The first is for the independent identically distributed (i.i.d.) case. 
The second is stated for independent observations with different distributions, and is suitable for 
proving consistency and asymptotic normality in regression models, for example, under model 
conditions. The third theorem also applies to regression model estimators, but is suited to give 
asymptotic results also outside model conditions. Applications and illustrations are provided in 
sections 3, 4 and 5. 

2A. A theorem for the i.i.d. case. Let Yi,Y2, ... be i.i.d. from some distribution F. A certain 
p-dimensional parameter 9q = 9{F) is of interest. Assume that one of the ways of characterising 
this parameter is to say that it minimises Eg{Y,t) = J g{y,t)dF{y), where the g{y,t) function 
is convex in t. Examples include quantiles, the mean, M-estimation and maximum likelihood 
estimation parameters and so on; see sections 3 and 4. In the expectation expression above, and 
later on, Y denotes a generic observation from the true underlying F. 

Some weak expansion of g{y, t) around the value ^o of t is needed, but we avoid explicitly 
requiring pointwise derivatives to exist. With this in mind, write 

g{y, 00 + t)- g{y, Oo) = D{y)'t + R{y, t) (2.1) 

for a D{y) with mean zero under F. If Ei?(y, t)^ is of order o(|tp) as t — )■ 0, as we will usually 
require, then D{y) is nothing but the derivative in quadratic mean of the function g{y,OQ + t) at 
t = 0. 

Theorem 2.1. Suppose that g{y,t) is convex in t as above, and that (2.1) holds with 

E{g{Y, 00 +t)- g{Y, Oo)} = ER{Y, t) = \t' Jt + o(|tp) as t ^ (2.2) 

for a positive definite matrix J . Suppose also that Var R{Y, t) = o(|ip), and that D{Y) has a finite 
covariance matrix K = J D{y)D(yy dF{y). Then the estimator On which minimises X]j<„fi'(^,i) 
is -^/n-consistent for 9q, and 

V^iOn - 00) = - J-in-1/2 ^ j^(Yi) + Op(l). (2.3) 

i<n 

In particular y/niOn - 0o) ^d - J"Wp{0, K} = AAp{0, J'^KJ-^}. 

Hjort and Pollard 3 May 1993 



Proof: Consider the convex function A„(s) = YliKnidO^i^^o + s/\/n) — g{Yi,0o)}. It is 
minimised by y/n{9n — 6q). Note first that nER{Y,s/^/rl) = ^s'Js + rn^o{s) where rn,o{s) = 
no(|sp/n) —7- for fixed s. Accordingly, using (2.1), 

Ms) = ^{DiYiys/^/^ + R{Y,,s/./^) - ERiYi,s/V^)}+nERiY,s/V^) 

i<.n 



U'^s + \s'Js + r„,o(s) + rn{s), 



in which 



^7n = n-i/2^i?(y,) and r„(s) = ^{i2(y„ s/^^) - Ei?(y„ s/\/^)}. 

Now r„(s) tends to zero in probability for each s, since its mean is zero and its variance is 
'^i^n^s.T R{Yi,s/ ^/n) = no(l/n). This, together with the Basic Corollary of section 1, proves 
(2.3) and the limit distribution result, since [/„ goes to a AApjO, K} by the central limit theo- 
rem. Note that both consistency and asymptotic normality followed /,from the same approximation 
argument. [] 

Note that \ax R{Y,t) = Ei?(y, t)^ + 0{t'^), so we might as well work with second moments 
rather than variances. Notice also that the differentiability assumption (2.2) is applied to the 
process obtained by averaging out over the distribution F, a smoothing that can eliminate trouble- 
some pointwise behaviour of R{y,t). Huber (1967) recognised this advantage of smoothing before 
differentiating. 

2B. A theorem for independent observations with different distributions. Assume that the true 
density of Yi is of the form fi{yi) = fi{yi, 6*0, ??i), where Oq is a certain p-dimensional parameter of 
interest. Suppose that an estimator On for ^o is proposed which minimises X^i<„5'i(^i,6'), where 
the gi{yi,9) functions are convex in 9. A simple example is linear regression, where Yi = O'^Xi + Si 
and 9n minimises X]i<„(^i - 9'xi)'^. 

Suppose that gi{yi,9o + t) - giiyi,9o) = Di{yi)'t + Ri{yi,t), where EA(^i) = 0. With the 
previous development in mind, write 

ERi{Yi,t) = ^t'Ait + Vi,o{t) and Yav Ri{Yi,t) = Vi{t), (2.4) 

and let Bi be the variance matrix for Di{Yi). The sums J„ = X]i<n ^^ ^^^ -^" ~ Yli<n^i ^^^ 
featured below. The first useful result, properly generalising Theorem 2.1, is the following, which 
is proved by copying the arguments of 2 A mutatis mutandis. 

Theorem 2.2. Assume that I^i<„'Ui,o(s/\/n) — > and J2i<n'^i(s/\/n) — > for each s, and 
that Jn/n and K^jn converge to J and K, where J is positive definite. Then y/n{9n — 9o) is only 
Op(l) away from — J~^n~^'^ Si<n -^j(^)- If in particular the Lindeberg requirements are fulfilled 
for the Di{Yi) sequence, then \fn{9n — 9^) -^d AApjO, J~^KJ~^}. 

1/2 

Another result which sometimes is stronger is as follows. Assume that X]j<„ Vifi{Jn s) — t- 
and l^i<„'Ui(^n s) — 7- for each s, and that J~^Kn is bounded. Then 

jy2(^„ _ e^) = -J-^/^K^J^Un + Op{l), (2.5) 

where C/„ = Kn ^i<„ Di{Yi). If in particular there are matrices J and K such that J~^Kn goes 
to J~^K, and the Lindeberg conditions are fulfilled, securing C/„ -^d A/'p{0,/p}, then J„' (0„ — 
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^o) -^d AApjO, J ^/^KJ ^/^}. This result is proved by studying the convex function ^i<„{5'i(^i, 

— 1/2 ~ 

Oq + Jn s) — gi{Yi,9Q)}. In some situations of interest J„ = i^„, further simphfying (2.5). See 
section 5 for an ihustration of this. 

2C. A theorem for regression type estimators outside model conditions. The results of 2B 
are sometimes not sufficient. Theorem 2.3 below will work for asymptotic behaviour of regression 
methods outside model conditions, as made clear in section 3D, for example. 

Assume that some covariate vector Xi = {xi^i, . . . jXi^p)' is associated with observation Yi. For 
simplicity we formulate a result in terms of densities, rather than general distribution functions. 
Suppose that the true density for Yi given Xi is f{yi\xi) but that some regression model postulates 
f{yi,/3\xi), for a suitable p-dimensional parameter vector /3. We consider an estimator /?„ defined 
to minimise X]i<n diO^i, I3\xi), where gi{yi, /3\xi) is convex in /3 for each (y^, Xi). In the following we 
shall assume that the empirical distribution of xi, . . . , x„, whether actually random or under the 
experimenter's control, converges to a well-defined distribution H in x-space. This conceptual limit 
is to be thought of as the 'covariate distribution'. Assume that n~^ J2i<n 9i0^ii f^l^i) converges in 
probability to a function with a unique minimiser /3o. 

Under these circumstances it is not generally possible to get a representation like the one that 
led to (2.4), because of heterogeneity as well as potential modelling bias, as the applications in 
section 3D and section 5C will illustrate. It becomes necessary to include a x^-dependent bias 
term. Suppose that it is possible to write 

9i{yi,Po + t\xi) - giiVi, Po\xi) = {5{xi) + Di{yi\xi)}'t + Ri{yi,t\xi), (2.6) 

where 'EjDi{Yi\xi) = and YK^Di{Yi\xi) = Bi{xi). Write furthermore 

¥.Ri{Yi,t\xi) = \t'Ai{xi)t + Vi^o{t\xi) and Var Ri{Yi, t\xi) = Vi{t\xi). (2.7) 

This time three matrix sums are needed, J„ = Ylii<n^ii^i)-' ^n = J2i<n^ii^i)^ ^^^ ^n = 

Theorem 2.3. Assume that the xi,X2, . . . sequence is such that 

y^Vi,Q{s/\/n\xi) ^ and y^Vi{s/^/n\xi) ^p for each s, (2-8) 

that the Jn/n sequence is bounded away from zero, and that the Kn/n and Ln/n sequences are 
bounded. Then 

V^iX - /3o) = -(J„/n)-i{n-i/2 ^ ^(^^) ^ ^-1/2 ^ D,{Y,\x,)} + e^, (2.9) 

where e„ = en(xi, . . . ,x„) ->p 0. 

The proof is quite similar to previous proofs in this section, taking as its starting point the 
convex function ^j<„{5'i(^i,/3o + s/y/n\xi) — gi{Yi.,fiQ\xi)}. We omit the details. 

The (2.9) representation has two statistically interesting implications, (i) In the conditional 
framework with a given Xi sequence, suppose that Jn/n — )■ J and Kn/n — )■ K and that the 
Lindeberg condition holds for X^j<jj n~^/^Di(li|xi). Then 

V^0n - /3o)|xi, . . . , x„ = AA4-(J„/n)-in-i/2 ^ 5(x,), J-^KJ-^] + 4, (2.10) 

i<.n 
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where e^ — J-p 0. So /3„ is approximately normal with variance matrix J~^KJ~^ /n, but actually 
biased with a bias depending on xi, . . . , x„. The bias is typically zero under exact regression model 
conditions, see 3D below, (ii) Secondly, if the x^'s can be treated as being independent and coming 
from their own 'design distribution' H{dx) in x-space, then S{xi) has mean zero and variance matrix 
L, say. In this unconditional framework 

V^0n - /3o) = J-'Mp{0, K + L} + Op(l) -^d Mp{0, J-\K + L)J-^}. (2.11) 



3. Applications and illustrations. 

3A. The median. Let Yi, 12, . . . be i.i.d. from a density /, let /i be the population median, and 
let Mn be the sample median from the first n observations. We shall prove the well known fact 
that Mn is consistent for /i and that 

V^(M„-^)^rfAA{0,l/4/(/x)2}, (3.1) 

provided only that / is positive and continuous at /x. 

This fits into the framework of 2A with the convex function g{y,t) = \y — t\. The (2.1) 
expansion reads 

\y-ili + t)\-\y-fi\= D{y)t + R{y, t), 

where D{y) = —I{y > fi} + I{y < fi}, and 

2{t-{y-fi))I{fi<y<fi + t} ift>0. 



^(^'*) \2{{y-fi)-t)I{fi + t<y<fi} ift<0, 

while R{y,0) = 0, which makes it easy to verify 

ER{Y,t) = f{fi)t^ + o{t'^) and ER{Y,tf = y{n)\tf + o{\tf). 

Actually we only need a distribution function with a positive derivative at /i. Of course we 
don't get the explicit \t\^ bound then. Notice that D{Y) and R{Y,t) are bounded functions even if 
|y — i| itself can have infinite expected value, since we work with the difference \Y — {p,+t)\ — \Y — ^\. 
Since the variance of D(Y) is equal to 1, assertion (3.1) follows from Theorem 2.1. See 4A below 
for an extension of this result. 

3B. Simultaneous asymptotic normality of order statistics. Let / be positive and continuous 
in its support region, and consider the function 

9p{y, t) = p{{y - t)+ - y+} + (1 - p){{t - y)+ - (-2/)+}. 

It is convex in t and its expected value is minimal for t = F~^{jp) = fip, the p-th quantile of the 
underlying distribution, and 

E{5p(y, t) - gp{Y, fip)} = y{fip){t - iipf + o{{t - iipf) 

can be shown. The (2.1) expansion works with 

D{y) = (1 - p)I{y < ^J'p] - pl{y > Atp} = I{y < fip} -p 
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and ER{Y,ty = 0{\t\^) can be checked. Let Qn,p be the minimiser of ^i<„fl'p(^i,i), which is 
sometimes non-unique, but which in any case is at most Op{n~^) away from the [np]'th order 
statistic Yn^pjy The general theorem of 2 A imphes 

Znip) = VniQn,p - fJ-p) = - f ifJ-p)'^ Vn{Fn{np) -p} + en{p), (3.2) 

where Fn is the empirical distribution function and £nip) — > in probability for each p. This 
links the quantile process Z„ to the empirical process, and proves finite-dimensional convergence 
in distribution of the quantile process to a Gaufiian process Z{.) with mean zero and covariance 
structure 

cov{Z{pi),Z{p2)} = „/ ... ^ . for pi < p2. (3.3) 

The traditional proofs of this finite-dimensional convergence result are rather messier than the 
above. There is in reality also process convergence here, of course, which is linked to the fact that 
sup5<s<i_5 |£n(p)| goes to zero in probability for each 5. Proving this is not within easy reach of 
our method, however. See also the comment ending 3D below. 

3C. Estimation in L„ mode. Let more generally Mn^a minimise X^j<„ \Yi — t|", where a > 1, 
and let ^^ be the population parameter that minimises E|y — t|". For a = | we would expect an 
estimator with properties somehow between those for the median and the mean, for example. We 
can prove 

E|y-^, |2(^-i) 

{ia-l)E\Y-W-^V 

assuming E|yp"~^^ to be finite. The proof proceeds by mimicking that for the simpler case a = 1. 
One needs to use 



V^{Mn,c.-^a)^dM{0,T^} where r^= ^.^ \,^n. . i»-2i2 ' (3-4) 



D{y) = -a{y - ^-'Hy > Ca} + «(^a - vT-'Hy < ?«}, 

and it is somewhat more cumbersome but feasible to bound Ei?(y, t)^. And finally needed is the 
analyticalfact that E{|y-(^a +t)|°- !>"-€« 1°} = \Kft^ + o{t^), in which Kj = a{a-l)E\Y - 

C \a-2 

It is interesting to note here that (a — l)E|y — Cq|"~^ tends to 2/(F~^(i)) as a tends to 
1, explaining the connection from the moment-type expression for the variance r^ of (3.4) to the 
rather different-looking expression for the median case. 

It is also worth pointing out that the (3.4) result can be reached via influence functions and 
function space methods as well. The influence function can be found to be 



aK-\y - i^[FT-^ \iy>i^(F), 
-aK-^\y - i^{FT-^ ify<^„(F), 



^(^'y) - ^ .rz^-lL, C (T?\\c.-\ 



after which the usual argument is that since \fn{Mn,a — Cq) = n""*^' ^ X^i<n -^(-^' ^0 + ^™' f*^'^ suit- 
able remainder term e„, one must have limiting normality with r^ = J I{F,yY dF(y), agreeing 
with (3.4). But proving that £„ here goes to zero in probability is not trivial, since the S,a func- 
tional is rather non-smooth. The argument can be saved via establishing Lipschitz differentiability, 
as in Example 1 of Huber (1967). Our method manages to avoid these somewhat sophisticated 
arguments. 

3D. Agnostic least squares and least absolute deviation regression. Statistical regression is 
about estimating the unknown centre value of Y for given x, i.e. the curve or surface centre(y|x), 
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based on p + 1-tuplets {xi,Yi), where 'centre' could be the mean or the median. Ordinary Hnear 
regression uses a linear approximation (3'x = Yl^=i l^j^i ^o^ this centre function, which is often a 
very reasonable method even if the true underlying centre function is somewhat non-linear. The 
least squares regression estimator is /3^x where /3n minimises Yli<n0^i ~ f^'^ifi ^^'^ the least 
absolute deviation estimator is /3^x where /?„ minimises Yli<n \^i ~ P'^i\- 

Statistical properties of these estimators are usually investigated only under the admittedly 
unlikely assumption that the true surface is linear and that the variances are constant over the full 
region, i.e. 

Yi = l^'oXi + asi (3.5) 

where the e^'s are i.i.d. standardised residuals centred around zero. An in some sense more honest 
approach would be to merely postulate that 

Yi = m{xi) + a{xi)ei, (3.6) 

for some smooth functions m{x) and o"(x), and view the regression surface estimator as an attempt 
to produce a good linear approximation to the evasive m{x). Our plan now is to derive properties 
under robust and agnostic (3.6) conditions using Theorem 2.3 of 2C, while assuming that the 
empirical distribution of x^'s converges to an appropriate 'covariate distribution' H. Under ideal 
(3.5) conditions they specialise to results obtainable using the simpler Theorem 2.2 of 2B. 

Consider least squares regression first, assuming the e^'s to have mean zero and variance one. 
This fits into the 2C framework with gi{Yi,l3\xi) = ^{Yi — l3'xi)^. The method aims at getting the 



i<n 



miXj 



best linear approximation fS'^x to m{x), in the sense of minimising the limit of n ^"^ 
(3'xif. In fact this means /3o = {EXX'y'^EXY. We find 

gi{Yi, Po + t\xi) - gi{Yi, po\xi) = -{Yi - ^'QXi)x[t + ^{t'xif 

= -{5{xi) + Di{Yi\xi)) t+\t'xix'it, 

in which 

5{xi) = {m{xi) - liQXi)xi and Di{Yi\xi) = {Yi - m{xi))xi. 

In the notation of (2.7) one has Ai{xi) = Xix[ and both remainder terms are simply equal to zero. 
Consider 

Jn = 2_^XiXi, Kn = / ^ Cr^Xj) XiX^, L„ = / ^ \Tn{Xi) — fjQXil XiXi- ("J-' j 

Two results can be given, corresponding to (2.10) and (2.11). First, suppose the Xi sequence is 
such that Jn/n — ^ a positive definite J, Kn/n — )■ K, that the Ln/n sequence is bounded, and that 
maxi^n o'{xi)'^\xi\'^ / "^-^^ a{xi)'^\xi\'^ — )■ 0. Then ^/n{f3n — /?o) is asymptotically normal with mean 
J'^n'^/"^ X]i<n('^(^*) ~/^o^«)^* ^^'^ variance matrix J~^KJ~^. Secondly, under the unconditional 
viewpoint where the x^'s are seen as i.i.d. with finite variance matrix L = E{m{X) — (3qX)'^XX' 
for 5{xi), then 

^{^r. - /3o) -^d AAp{0, J-\K + L)J-^}. (3.8) 

Note that K + L can be estimated consistently with n~^ Yli<n0^i ~ f^'nXiYxix'i. 

These results can also be derived more or less directly, i.e. without the convex machinery of 
section 2, see Exercise 45 in Hjort (1988b). In the least absolute deviation case to be reported on 
next a direct approach is much more difficult, however, but it can be efficiently handled using the 
methods of section 2. 
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For the LAD regression case, take the e^'s of (3.6) to have distribution F with median zero and 
variance one. We will assume that F has a density / which further possesses a continuous derivative 
/'. In this case gi(Yi,/3\xi) = \Yi — f3'xi\, and the method aims at getting the best approximation 
/3qX to m{x) in the sense of minimising the long term value of n~^ X^i<n E|m(xi) — f3'xi + a{xi)ei\. 
We skip the various details that have to be worked through to reach a result here. They resemble 
those above and arguments used in 3A. To give the result, let Di(Yi\xi) = 1 liYi < (S'^Xi and —1 if 
Yi > /S'^Xi, with conditional mean h{xi) = 2Pic{m{xi) + a{xi)£i < /3oXi} — 1, and consider the three 
matrices 



Jn = '^2fi{/3QXi-m{xi))xix'i, K„ = ^{1- h{xif}xix'-, L„ = y^h^Xj^XiX 



where fi{z) = f{z/a{xi))/a{xi) is the density of the scaled residual a{xi)ei. In particular Kn+Ln = 
YliKn^i^'i- ^^ ^°^ ^^^ least squares case these efforts lead to a representation 

V^0n - /3o) = -{Jn/n)-^ [n-1/2 ^ h{xi)x, + n-1/2 Y^{D,{Y,\x,) - h{xi)]x^ + e„. (3.9) 

This has one implication for given Xi-sequences and another implication for the 'overall variabil- 
ity'. Under some mild assumptions Jn/n — )■ J and {K^ + Ln)/n -^ K + L, and \/n{l3n — /3o) -^d 
Mp{0,J~^{K + L)J~^}. The K + L matrix is estimated consistently using ^}^i<n-^^^'il'^ whereas 
a more complicated consistent estimate, involving smoothing and density estimtaion, can be con- 
structed for J. 

The special case of med(y|x) = m{x) = /3qX has J„ = X]i<n 2/i(0)xiX^/o"(xi), and the perfect 
but perhaps unrealistic case of both a linear median and a constant variance has J~^ {Kn+Ln) J~^ = 
{Af{0f}-'^{J2i<nXix'i)-^a'^. This is the case considered in Pollard (1990). 

Our method can also be applied to the quantile regression situation, where one aims to estimate 
m(xo) + a{xo)F~^{p), for example, to construct a prediction interval for a future y at a given 
covariate value xq- This time one minimises X]i<n5'p(^*'/^'-^«) '^ith the gp function of 3B. This 
gives a suitable generalisation of results reached by Bassett and Koenker (1982). 

4. Maximum likelihood type estimation. 

4A. Log-concave densities. Suppose Yi,Y2,. . . are i.i.d. from some continuous density /, and 
that a parametric model of the form f(y,6) = f(y,6i, . . . ,6p) is employed, where the parameter 
space is some open and convex region. We stipulate that \ogf{y,0) be concave in 6 in this region 
and shall be able to reprove familiar results on maximum likelihood (ML) and Bayes estimation, 
using the convexity based results of section 2, but with milder smoothness assumptions than those 
traditionally employed. 

Note that the log-likelihood X]i<„log/(yi,^) when divided by n tends to Elog/(y, 0) = 
J f{y)logf{y,9)dy, for each 6. Assume that this function has a unique global maximum at 6q, 
which is the 'agnostic parameter value' that gives best approximation according to the Kullback- 
Leibler distance jf{y)iog{f{y)/f{y,9)}dy ^from truth to approximating density. From section 
2A the following result is quite immediate. 

Theorem 4.1. Suppose logf{y,6o + t) - logf{y,9o) = D{y)'t + R{y,t) is concave in t, for 
a D{.) function with mean zero and finite covariance matrix K under f, and that the remainder 
term satisfies 

E{log f{Y, 00 + t)- log f{Y, 6o)} = ER{Y, t) = -\t' Jt + o{W) (4.1) 
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as well as Yar R{Yi,t) = o(|tp), where J is symmetric and positive definite. Then the maximum 
likelihood estimator 9n is ^/n-consistent for 9q and 



i<n 

In ordinary smooth cases one can Taylor expand and use D{y) = dlogf{y,6Q)/d9 and find a 
remainder R(y,t) with mean —^t'Jt + 0(|tp) and squared mean of order 0(|t|'*), involving 

..-E,5!i2gg^ and .-.VAR,™!^. ,4.2) 

Notice that when the model happens to be perfect, as in textbooks for optimistic statisticians, then 
K = J, and we get the more familiar AApjO, J~^} result. 

Example. In addition to the median M„ in the situation of 3A, look at the mean absolute 
deviation statistic r„ = n~^ Yli<n 1^* ~ -^n|- We will show simultaneous convergence of y/n{Mn — 
fj,,Tn — t), where r = E|l^ — /i|, and for this assume finite variance of the 1^'s. 

This can be accomplished by considering the parametric model f{y,iJ,,T) = (2r)~^ exp{— |y — 

/i|/r} for data. This model may be quite inadequate to describe the behaviour of the data sequence, 

but the ML estimates are nevertheless M„ and r„ as above. The traditional theorems on ML 

behaviour require more smoothness than is present here, and indeed often require that the true / 

belongs to the model, but Theorem 4.1 can be used. This is because log/(y,//, r) is concave in 

(/U, 1/r). Verifying conditions involves details similar to those in 3A, and we omit them here. The 

rGsult IS 

/V^(M„-^)\ ^^r^0^ /l/{4/(^)2}, cov V 

V^/^(f„-r) j^'^-^n^oj'V cov, Vary.-r^ji- 

where the covariance is FiI{Yi < fi}\Yi — fi\ — ^r. Note that there is asymptotic independence if / 
is symmetric around fi. 

4B. Bayes and maximum likelihood estimators are asymptotically equivalent. It is well known 
that Bayes and ML estimation are asymptotically equivalent procedures in regular situations. In 
other words, if 0* is the Bayes estimator under some prior Tr{6), then \/n{9^ — 9o) has the same 
limit distribution as \/n{9n — 9o). The standard proofs of this fact involve many technicalities, 
and furthermore are typically restricted to calculations under the assumption that the underlying 
f{y,9o) model is exactly correct, see e.g. Lehmann (1983, chapter 6.7). Below follows a reasonably 
quick proof of this fact, and it is reassuring that the result is valid also outside model circumstances. 

Let tt{9) be a prior density, assumed continuous at ^o and satisfying the growth constraint 

7r{9)<Ciexp{C2\9\) for all 0, 

where Ci and C2 are positive constants. The posterior density is proportional to Ln{9)Tr{9), where 
Ln{&) = Wi<n fO^i-i^) is the likelihood. The Bayes estimator 6** (under quadratic loss) is the 
posterior mean. Note that improper priors are accepted too. 

We shall make use of the following dominated convergence fact, which is a special case of 
Lemma A3 in the appendix. Suppose {G„(s,a;)} is a sequence of random functions (assumed 
jointly measurable) such that G„(s,a;) — )■ G{s) in probability, for each s. Suppose H{s) is an 
integrable function for which the set {oj: |G„(s,a;)| < H{s) for all s} has probability tending to 
one. Then jG„(s,a;)ds — )■ j G{s)ds in probability. (Apply Lemma A3 with Xn equal to G„ 
restricted to the set where G„ < H.) 
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Theorem 4.2. Under the conditions of Theorem 4.1, the MLE estimator On and the posterior 
mean 0* are asymptotically equivalent, in the sense that ^/n{9n — On) -^p 0. 
Proof: Define the random convex function An{s) by 

exp(-yl„(s)) = Ln{6n + sl^/n)lLn{Qn)- 

By definition of tlie ML estimator, An achieves its minimum value of zero at s = 0. By the change 
of variables 6 = On + sj \fn we find 

^, ^ /gL„(g)7r(0)dg ^ ^ ^ 1 Jgexp(-^„(g))7r(g„ + g/V?I)exp(-C2|gn|)dg 
" / L„(e)7r(0)de " ^/^ /exp(-^„(s))7r(^„ + s/^/^)exp(-C2|^„|)ds ■ 

1 „/ 



The random function An converges in probability uniformly on compact sets to -^s' Js. Define 
7„ = infui^i An{t). It converges in probability to 70 = infui^^ ^t' Jt > 0. Argue as in Lemma 2 
to show that An{s) > 7ra|s| for |s| > 1. The domination condition needed for the fact noted above 
holds in both numerator and denominator with 



0. 



rr( .j 2Ci if 1^1 < 1, 

'"^^^~ \Ci|s|exp(-i7o|s|) if|s|>L 

The ratio of integrals converges in probability to 

J sexp(— ifi'Js) 7r(^o) exp(— C2|0o|) ds 
/ exp(— is'Js) 7r(^o) exp(-C2|6'o|) ds 

The result follows. [] 

5. Logistic regression. Suppose that p + 1-tuplets {xi,Yi) are observed, where Xi = 
(xi^i, . . . , Xi^p)' is a covariate vector 'explaining' the binomial outcome Yi. The logistic regression 
model postulates that the 1^'s are independent with 

Pr{y, = l|xa = g(x„/5) = -^^^2^^Jg^ for some /3 = /3o, (5.1) 

and the ML estimator /3„ = {/3n,i, ■ ■ ■ ■,l3n,p)' maximises the log-likelihood function 

Y, [y, log q{xi, /3) + (1 - Yi) log{l - (?(x„ /?)}] = Y, [Yifi'xi - log{l + exp(/3'x,)}] • 

Of course the asymptotic normality of this estimator is well known and widely used, but precise 
sufficient conditions are not easy to find in the literature. 

We will soon arrive at such, employing results of 2B, which are applicable since the summands 
above are concave in (3. As a preparatory exercise we mark down the following little expansion, 
which holds for all u and u + h, m. terms of tt{u) = exp(u)/{l + exp(ii)}: 



1 + exp(u + h) 
1 + exp(u) 



log ^''^^.:l/^.^ = ^(^)^ + I^W{1 - <u)}h' + \t:{u){1 - 7r(u)}7(n, h)h^ , (5.2) 



where |7(u, /i)| < exp(|/i|). This is proved from the exact third order Taylor expansion expression, 
with third term equal to |7r(u'){l — it{u')}{1 — 27r{u')}h^, for appropriate u' between u and u + h. 
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Some analysis reveals that tt{u'){1 — tt{u')} < exp(|/i|) tt{u){1 — tt{u)}, regardless of u and h. This 
is in fact quite similar to what results from using Lemma A2 in the appendix, but the bound on 
the remainder obtained here suits the problem better. 

5A. Under model conditions. In the spirit of our two aims, laid out in the Introduction, we will 
first give a simpler result with a 'pedagogical proof, and then sharpen the tools to reach a second 
result with minimal regularity conditions. Under model conditions (5.1), write for convenience 
Qi = q{xi,l3o), and let J„ = J2i<n1i(^ ~ Qi)^i^i ^e the information matrix. 

Theorem 5.1. Assume that fin = maxi<„ \xi\/y/n — )■ and that Jn/n — t- J. Then, under 
model conditions (5.1), y/n{fin — /5o) -^d AApjO, J~^}. 

Proof: We wih use Theorem 2.2 with gi{yi,f3) = log fi{yi,P) = yi/3'xi - log{l + exp(/?'xi)}. 
The expansion noted above yields 

^°S \7 °fl ^ = yi^'^i ~ [1°S{1 + exp(/3oXi + t'xi)} - log{l + exp(/3oXi)}] 
Ji[yi,Po) 

= [Vi - qi)x'it - igi(l - qi){t'xif - \qi{l - qi)ji{t){t'xif 
= Di{yi)'t-Ri{yi,t). 

Bere Di{yi) = {yi-qi)xi and Ri{yi,t) = ^t'qi{l-qi)xiXit + Vi^o{t), where |7i(t)| < exp(|t'xi|) in the 
expression for Vi^o{t). Note that J„ = Kn, in the notation of Theorem 2.2, and that Ri(Yi,t) has 
zero variance, so what we have to prove is (i) that ^i<„i'i,o('5/\/^) ~^ Oi (ii) that the Lindeberg 
conditions are satisfied for ^i<„?T'~"'^'^(^ — qi)xi. But 



^Vi^o{s/Vn)\ < ^|g'i(l -qi)exp{\s'xi/^/n\)\s'xi/^/n\^ 
- X] kli(^ ~ 'ii) exp(|s|;U„) {s'xix[s/n) |s|/i„ 

Z<?7, 

= ||s|;U„exp(|s|;U„)s'(J„/n)s, 
which goes to zero. And the Lindeberg condition is that for each s and 6 

Y,En-\Y, - qifis'xifnm " QiWxJ^l > 5} ^ 0, 



i<in 



and this sum is bounded by s'(J„/n)s/{|s|;U„ > (5}. This ends the proof. [] 

If the XiS are i.i.d. from some covariate distribution if, then /i„ — t- a.s. exactly when the 
components of Xi have finite second moment. This also secures convergence of Jn/n to J = 
f q{x,l3o){l - q{x,/3o)}xx'H{dx). 

Our second and sharper theorem is proved next, by squeezing more out of the bound of the 
Vi,o{t) remainder and more out of the Lindeberg condition. 

— 1/2 

Theorem 5.2. Assume that the A„ = maxi<„ | J„ ' Xi\ sequence is bounded, and that 

Nni5) = ^ ,^,(1 - qi) x'iJ-^x^ /{| J-^/^x.l > (5} ^ for each positive 6. (5.3) 

i<n 

Then, under model conditions (5.1), Jj {f3n — /3o) -^d AApjO, Ip}. 
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1/2 

Proof: Weconsider the random convex function ^j<j^{log/i(yi,/3o + Jn s) — log fi{Yi, f3o)}, 
which upon using the expansion again can be rearranged as U^s — ^s's — r'„(s), where C/„ = 

Jn^ J2i<n0^i ~ 1i)^i ^'^'^ ^n{s) = J2i<n ^Qii^ ~ Qihii^' Jn'^ Xi){s' Jn^ Xi)^ . We are to provc (i) 
that r„(s) — )• 0, and (ii) that C/„ — J-^ Mp{0,lp}. 

At this stage we call on appendix Al where it is shown that (5.3) is a sufficient and actually also 
a necessary condition (ii) to hold. And |r„(s)| is bounded by ^j<„ ^^^(l — <?i) exp(|s'J„ Xi|)|s' 

Jn Xi\^. We split this sum into \Jn Xi\ < 6 summands and \Jn Xi\ > 6 summands. The 
first sum is bounded by ^|sp(5exp(|s|(5), and the second is bounded by i|spA„exp(|s|A„) iV„((5). 
Letting n — )■ oo and 5 — )■ afterwards shows that indeed r„(s) — )■ 0. [] 

It is worth noting that the Nn{S) — t- condition in the theorem serves two purposes: forcing 
an analytic remainder term towards zero, and securing uniform neglibility of individual terms in 

-I/O 

the large-sample distribution of J„ ' Si<n^i(^)' ^•^- ^ normal limit. Note also that A„ — )■ 
suffices for the conclusion to hold, since Nn{6) < pXn/6. 

5B. Outside model conditions. Let us next depart from the strict model assumption (5.1), which 
in most cases merely is intended to provide a reasonable approximation to some more complicated 
reality, and stipulate only that Pr{y = l|x} = q{x) for some true, underlying q{x) function. 
Fitting the logistic regression equation makes sense still, and turns out to aim at achieving the best 
approximation q{x,f3) to the true q{x), in a sense made precise as follows. Let 

AM^),q{x, (3)] = q{x) log -4^ + {1 - q{x)} log ■ ^ " ^^""^ 



q{x,l3) ' ' ^' '^ "l-^(x,/3) 

be the KuUback-Leibler distance from true binomial (1, q{x)) to modelled binomial (1, q{x, /3)), and 
let A[q{.), q{., /3)] = J Ax[q{x), q{x, /3)] H{dx) be the weighted distance between the true probability 
curve to the modelled probability curve, in which H again is the 'covariate distribution' for x's, 
as discussed in 2C. The following can now be proved using methods of 2C: ML estimation is 
y^-consistent for the value /3o that minimises the weighted Kullback-Leibler distance A, and 
y/n{f3n — Po) -^d A/rfjO, J~^KJ~^}, provided the two matrices 

J = ^XX'q{X, /3o){l - q{X, P^)} = j xx' q{x, /3o){l - q{x, /3o)} H{dx), 

K = ^XX'{Y - q{X,M? = [xx'[q{x){l - q{x)} + {q{x) - q{x,/3o)y] H{dx) 

are finite. This result was also obtained in Hjort (1988a), where various implications for statistical 
inference also are discussed. 

6. Cox regression. In this section new proofs are presented for the consistency and asymp- 
totic normality of the usual estimators in Cox's famous semiparametric regression model for survival 
analysis data. The parametric Cox regression model is somewhat simpler, and is treated in 7A be- 
low. The regularity requirements we need turn out in both cases to be weaker than those earlier 
presented in the literature. 

The most complete results and proofs in the literature for the basic large-sample properties 
of the estimators in this model are perhaps those of Andersen and Gill (1982) and Hjort (1992). 
Andersen and Gill obtain results under the conditions of the model, and with regularity conditions 
quite weaker than earlier i.i.d. type assumptions, whereas Hjort explores the large-sample behaviour 
also outside the conditions of the model. For a history of the Cox model and the various approaches 
to reach asymptotics results, see Andersen, Borgan, Gill & Keiding (1992, chapter VII). 
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Our present intention is to provide yet another proof, which in several ways is simpler and 
requires less involvement with the martingale techniques than the one of Andersen and Gill. As in 
the previous section we choose to present two theorems, reflecting our two aims explained in section 
1. The first holds when the covariates are bounded, in which case the proof is quite transparent, 
and extra regularity conditions can be kept quite minimal. The second version is more sophisticated 
in that it tolerates unbounded covariates and weakens regularity conditions further. 

The usual Cox regression model for possibly censored lifetimes with covariate information is 
as follows: The individuals have independent lifetimes T^, . . . ,T^, and the i-th has hazard rate 

Xi{s) = A(s)exp(/3'zi(s)) = A(s) exp(/3iZi,i(s) H f3pZi^p{s)), (6.1) 

depending on that person's covariate vector Zi{s), and involving some unspecified basis hazard 
rate A(s). As indicated the covariates are allowed to depend on time s, and they can be random 
processes, as long as they are previsible; Zi{s) should only depend on information available at time 
s— (for a full discussion of previsibility, or predictability, see Andersen et al. (1992, p. 65-66)). There 
is a possibly interfering censoring time Ci leaving just Ti = m.m{T^, Ci} and Si = I{T^ < Ci} to the 
statistician. Consider the at risk indicator function Yi{s) = I{Ti > s}, which is left continuous and 
hence previsible, and the counting process Ni with mass 5i at Ti, i.e. dNi{s) = I{Ti £ [s, s+ds], Si = 
1}. The log partial likelihood can then be written 

Gn{0) = y^f {(3' z,{s)- log Rr,{s,l3)]dN,{s), (6.2) 

featuring the empirical relative risk function Rn{s,j3) = X]i<n^*('^) ^^pI/^'-^*!^))' ^^^ ^^^ example 
Andersen et al. (1992, chapter VII). It is assumed that data are collected on the finite time interval 
[0, L] only. The Cox estimator is the value /3„ that maximises the partial likelihood. 

Lemma A2 of the appendix allows us an expansion for log i?„(s, Pq+x), using Wi = Yi{s) exp(/3Q 
Zi{s)) and a^ = Zi{syx. The result is 

logi?n(s,/3o +X) -logi?„(s,/3o) = Zn{s)'x+ \x'Vn{s)x + Vn{x, S), (6.3) 

where 

^n{s) = '^Pn,i{s)Zi{s) and Vn{s) = ^p„,i(s)(Zi(s) - Zn{s)){Zi{s) - Zn{s))\ (6.4) 

and j)„^i(s) = li(s) exp(/3QZi(s))/i?„(s,/3o). A bound for the remainder term in (6.3) is |f„(x,s)| < 
I maxi<„ \{zi{s) — z„(s))'xp. Observe that z„(s) and Vn{s) can be interpreted as the mean value 
and the variance matrix for Zi{s), where this covariate vector is randomly selected among those at 
risk at time s with probabilities proportional to the relative risks exp(/3QZi(s)). 
All this leaves us suitably prepared for a theorem. 

Theorem 6.1. Assume that the hazard rate for the i'th individual follows the Cox model 
(6.1) with a true parameter j3q and a continuous positive basis hazard A(s), and that the covariate 
processes Zi{s) are previsible and uniformly bounded. Assume that 

J„(s) = n~^'^Yi{s) eyip{(3QZi{s)) {zi{.s) - Zn{s)){zi{s) - z„(s))' ->p J(s) (6.5) 

i<n 
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for almost all s in [0, L] and that J = L J{s)X{s) ds is positive definite. Then /?„ is y/n-consistent 
for /3o and ^0n - M -^d AAp{0, J'^}. 

Proof: As a simple consequence of earlier efforts we have 

G;(x) = G„(/3o + x/^) - G„(/3o) 

= V/ [n-^''^{zi{s)-Zn{s))'x-\n-'^x'Vn{s)x-Vn{x/y/n,s)]dNi{s) (6.6) 

where we write 

using -/V„(.) = Yli<n ^i(-) *o denote the aggregated counting process for the data. The designated 
remainder term is r„(x) = J^ f„(x/-v/n, s) diV„(s), which goes to zero, since it is bounded by 

Jq |(2i^)'^|a;p/n^' ^ diV„(s), which is 0{n~'^''^). The K here is the absolute bound on the covariates. 
That the (6.6) function is concave in x is clear from the convexity of log Rnis, /3) in f3. By the basic 
method of section 1 it only remains to show (i) that J* — T-p J and (ii) that Un -^d AApjO, J}. 

At this stage we need some of the easier bits of the martingale representation and conver- 
gence theory for counting processes, but manage to avoid needing some of the more sophisti- 
cated inequalities and technicalities that have invariably been present in earlier rigorous proofs, 
like in Andersen and Gill (1982). The counting process Ni has compensator process Ai{t) = 
/o i;(s)exp(/3^Zi(s))dA(s), writing dA(s) = \{s)ds. This means that Mi{t) = Ni{t) - Ai{t) is 
a zero mean martingale, with increments dMi(s) = dNi{s) — Yi(s) exp{PQZi{s)) dA(s). One can 
show that Mi{t)'^ — Ai(t) as well as Mi{t)Mj{t) are martingales too, when i ^ j, which in mar- 
tingale theory parlance means that Mi has variance process {Mi,Mi)[t) = Ai[t) and that they 
are orthogonal, i.e. {Mi,Mj) = 0. See Andersen et al. (1992, chapter II), for example. Inserting 
dA^i(s) = dMi{s) + dAi{s) in (6.7) leads to 

J:= [ Jn{s)dA{s)+n-'y2 f Vn{s)dMi{s) (6.8) 

Jo -^ Jo 

and 

Un = n-^l^ V I {z,{s) - Zn{s)) dMi(s), (6.9) 

in that two other terms cancel. 

We are now in a position to prove (i) and (ii). Note that the first term of (6.8) goes to J in 
probability by boundedness of the integrand and Lemma A3 in the appendix. The second term is 
Op(n~^'^), which can be seen using boundedness of covariates in conjunction with the result 

EJ/ Y,Hi{s)dM,{s)y =eY,[ H,{sfd{Mi,M,){s), 

i<n i<n 

valid for previsible random functions Hi. This proves (i). To prove convergence in distribution of 
Un we essentially use the version of Rebolledo's martingale central limit theorem given in Andersen 
and Gill (1982, appendix I). Its variance process converges properly, 

pL pL 

{Un, Un){L) =n-'y^ {z,{s) - zn{s)){z,{s) - z„(s))' d{Mi, M,){s) = / Jn{s) dA{s) -^p J, 
•^ Jo Jo 
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and the necessary Lindeberg type condition is also satisfied: 

n-^ Y, I k*(s) - ^n(^)l' I{n-^'^\z^{s) - z^{s)\ > e] Y,{s) exp(/3^z,(s)) dA(s) ^p (6.10) 

since the indicator function ends up being zero for all large n. [] 

Next we present a stronger theorem with weaker conditions imposed. The proof is basically 
the same as for the previous result, but more is squeezed out of bounds for remainder terms and 
out of conditions for the martingale convergence to hold. 

Theorem 6.2. Assume that the hazard rate for the i'th individual follows the Cox model 
(6.1) with a true parameter /3o and a continuous positive basis hazard X{s). Assume that J„(s) 
goes to some J{s) in probability for almost all s, as in (6.5), and that j^ J„(s)A(s)ds — s-p J = 
/p .J{s)X{s)ds, a positive definite matrix. Suppose finally that 

^n(s) = n~ ' m.ax\zi[s) — Zn{s)\ -^p for almost each s (6-11) 

and that maxs<i ^in{s) is stochastically bounded. Then again ^/n{(3n — /3o) — s-d AApjO, J~^}. 

Proof: (6.6) and (6.7) still hold, and we plan to demonstrate (i) r„(x) — T-p 0, (ii) J* — J-p J, 
and (iii) [/„ ^d^fp{0,J}■ 

(i) is proved by using the tighter bound for Vn{x,s) of (6.3) available by employing Lemma 

2 
3^ 



A2, namely |(7(maxi<„ \{zi{s) — Zn{s)yx\) x'Vn{s)x, for g(u) = nexp(2n + 4ti^). This leads to 



\rn{x)\ < / lg{nn{s)\x\) x'Vn{s)xdNn{s)/n. 
Jo 

Split this into two terms, using diV„(s) = i?„(s, /3o) dA(s) + ^j<„ dMi{s). The first of the resulting 
terms goes to zero in probability by assumptions on J„(s) and dominated convergence (appendix 
A3), and the other term is of smaller stochastic order. Secondly (ii) follows as in the previous 
proof, since the second term of (6.8) vanishes in probability, by variations of the same arguments. 
Finally two ingredients are needed to secure (iii). The first is {Un,Un){L) — T-p J, which holds 
by assumptions as in the previous proof. The second is a more elaborate demonstration of the 
Lindeberg type condition (6.10), now accomplished by bounding it with 

/ Tr(J„(s))/{^„(s)>e}dA(s), 
Jo 

which goes to zero in probability by dominated convergence (the integrand goes pointwise to zero 
in probability and is dominated by Tr(J„(s)), see appendix A3 again). 

And all this combined with the Basic Corollary triumphantly implies that the argmax of the 
(6.6) function, which is \/n(/3n — /3o), is only Op(l) away from the argmax of U^x — ^x'Jx, which 
is J~^Un. This proves consistency and asymptotic normality. [] 

Remarks, (i) Usually one would have Vn{s) — s-p V{s) and n~^Rn{s, I3q) — s-p i?(s,/5o), say, so 
that J„(s) = Vn{s)Rn{s, l3Q)/n — J-p J{s) = V{s)R(s, fSo); in particular J = Jq V (s) R(s , (Sq) dA{s) 
in such cases, and this is the expression typically encountered for the inverse covariance matrix, (ii) 
The Andersen and Gill regularity requirements include rather strong uniform convergence state- 
ments, in both time s and /? near /3o. In the development above this would mean requiring 



sup sup 



n-^yYi{s)z,is)z,{syexpi(3'z,is)) - J{s,/3) 






>pO, 
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for example, for a suitable neighbourhood U{I3q) and a suitable limit function J{s,l3). This con- 
trasts sharply with our condition (6.5), which is only about /3o, and is pointwise in s. Andersen and 
Gill also include various other asymptotic stability conditions, about uniform continuity and differ- 
entiability in /3 of their limit functions, that are not needed here. Similarly, their conditions almost 
require max<;<L IJ-nis) — s-p where we come away with pointwise convergence, (iii) is interesting to 
see that the key requirement (6.11) serves two different purposes: forcing an analytical remainder 
term towards zero as well as securing uniform negligibility of individual terms, i.e. limiting normal- 
ity, (iv) The methods used here can be applied to solve the large-sample behaviour problem also 
outside model conditions, say when the true hazard rate is A(s) r{zi^i{s), . . . , Zi^p{s)) for individual 
i. See Hjort (1992) for results. There are also various alternative estimation techniques that can be 
employed in the Cox model, see for examples Hjort (1991) for local likelihood smoothing and Hjort 
(1992) for weighted log partial likelihood estimation. Again techniques from the present paper can 
be applied, (v) Finally Jeffreys type arguments can be given in favour of using the vague prior 
7r(/?) = 1, see Hjort (1986), where it is also shown that the (improper) pseudo-Bayes estimator 
/?* = J/3exp(G„(/3)) d/3/ Jexp(G„(/?)) d/3 is asymptotically equivalent to the Cox estimator /?„. 
The arguments of 4B can be used to provide a quicker and simpler proof of this. 

7. Further applications. 

7A. Exponential hazard rate regression. The traditional Cox model (6.1) is semiparametric, 
since the basis hazard rate A(.) there is left unspecified. The parametric version 

Ai(s) = Ao(s) exp(/3'zi(s)) = Ao(s) exp(/5i2;i,i(s) H h /3pZi,p(s)), (7.1) 

where Ao(.) is fully specified (equal to 1, for example), is also important in survival data analysis. Let 
us briefly show that the arguments above efficiently lead to a precise theorem about the maximum 
likelihood estimator in this model as well. 

With notation and assumptions otherwise being as in section 6 the log-likelihood can be written 

logL„(/?)=V / {(3'z,{s)dNi{s)-Y,{s)eMl^'ziis))ds}, 

see for example Andersen et al. (1992, chapter VI), and let /?„ be the ML estimator maximising 
this expression. Assume that data follow (7.1) for a certain /Jq. Using martingales dMi(s) = 
dNi{s) — Yi{s) exp(/3QZi(s)) dAo(s), writing dAo(s) = Ao(s) ds, we find 



Gn{x) =logL„(/3o + x) -logL„(/3o) 

= V / \x'z,{s) {dMi{s) + y,(s) exp(/3^z,(s)) dAo(s)} 
.^.^ Jo L 



i<n 



Yi{s) expil^QZiis)) {exp{x' Zi{s)) - 1} ds 



[Yl / Zi{s)dMi{s)'^'x-Y^ / Yi{s)exp{^'oZi{s)){exp{x'zi{s))-l-x'zi{s)}dAo{s). 

i<n i<n 

Now use I exp(ii) — I — u — ^u^| < ^\u\^ exp(|ti|), and introduce 

Jn = Y,f y.(s)exp(/3^z,(s))z,(s)z,(s)'dAo(s) and U^ = J'^'^Y. I ^^(^)dMi(s) (7.2) 
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to find Gn{Jn x) = U^x — 2\x\ ~ ''^n{x), with a remainder bound 

\rn{x)\ < i V / Us)exp{l3'oZi{s) + \x'J-'/^z,{s)\mx'J-'/^z,{s)\^ds. (7.3) 



i<n ' 



To formulate a theorem with quite weak conditions, let Jnis) = ^i<„ ^(s) exp(/3QZi(s)) Zi{s)zi{sy , 

so that the 'observed information matrix' is J„ = L Jnis) dAo(s). 

Theorem 7.1. Let Xq{s) he positive and continuous on [0,L]. Suppose there is a Cn sequence 
converging to infinity such that Jn{s)/cn for almost all s goes in probability to some J{s), and that 
Jn/cn — j-p J = Jn <^("5) dAo(s), whcrc this limit matrix is positive dehnite. Assume furthermore 
that for almost all s, 

Nn{s,6) = j;zi(s)'J-iz,(s)yi(s)exp(/3^z,(s))/{|J-i/2z,(s)| > 6} ^p (7.4) 

1/2 

for each 5 > 0, and that fin{s) = max^<^ \Jn ^i{s)\ ^s stochastically bounded^ uniformly in s. 
Then J^/^X - /3o) ^d Mp{0, Ip}. 

Some brief remarks are in order before turning to the proof, (i) Zi{s)J~^Zi{s) can be replaced 
by Zi{s)'J~^Zi{s)/cn here, and | J„ Zi{s)\ with \J~^/^Zi{s)\/cn ■ (ii) In many practical situations 
the Cn will be equal to n. (iii) The elements of J„ may in some cases concievably go to infinity with 
different rates, and then the 'asymptotic stability' requirement should be the existence of matrices 
Cn going to infinity such that C~^Jn{s) — s- J{s) etcetera. The theorem still holds, (iv) In many 
cases one would have /U„(s) — s-p for almost all s, and this implies condition (7.4), since in fact 
Dn{s-,S) < pl{iin{s) > 6}. (v) If the Zi{s) covariate processes are uniformly bounded, then (iv) 
applies and hence the conclusion, (vi) Our conditions are much weaker than those used elsewhere 
to secure large sample normality, see for example Borgan (1984, section 6). (vii) Finally we note 
that the proof below becomes easier under circumstances (iv) or (v). 

— 1/2 

Proof: The log-likelihood is concave by Lemma A. 2 and hence so is the Gn{Jn x) function. 
We are to prove (i) r„(x) — )■ in probability for each x, and (ii) that C/„ — ;• AApjO, Ip} in distribution. 

To prove (i) let r„(x,s) be the integrand in the bound occurring in (7.3), so that |rn(x)| < 
h lo ^n{x,s)dAQ{s). It will suffice to show that rn{x,s) — t- in probability for almost all s and 

1/2 1/2 

to bound it properly. Splitting into |J„ Zi{s)\ < 6 terms and |J„ Zi{s)\ > 6 terms we find 
fnixjs) < \x\^ 6 exp{\x\6) + \x\'^fj,n{s)exp{\x\fin{s)) Nn{s,5), after which the claim follows by our 
precautions and by the dominated convergence lemma of the appendix. 

Next (ii) can be replaced by U* — >d AApjO, Ip}, where U* = J2i<n lo ^"^ J~^l'^Ziis)AMi{s)^ 
and we show this employing the Rebolledo theorem version given in Andersen and Gill (1982, 
appendix I). Its variance process converges properly, 

{Ul,\Jl){L) = Y^ l\-^J-^l^z,{s)z,{s)'J-^I^Y,{s)^Mf^'^z,{s))dt.^{s) 

■^ JO 
— 'J ('Jn/C-nJ'J T'p -Ip, 

and the Lindebergian condition is also satisfied: 



E 



<V"'"'.(s)l" -;{<:; ""|J-""i.(.)l>«}F,(,)exp(;3;,i,(s))dA,(»)^,0. 







i<n ' 
Asymptotics for minimisers 18 May 1993 



This holds since the integrand is asymptotically the same as Nn{s,S) of (7.4), and is bounded by 
the constant p, so that the dominated convergence lemma applies. [] 

7B. Poisson regression. Suppose Yi, . . . ,¥„ are independent counts with 

li '^ Poisson(meani), with mean^ = exp(/3'zi), (7.5) 

depending on a certain p-dimensional covariate vector Zi, for a certain true parameter value (3q. It 
is convenient now to write exp(ti) = l + u+ ^v? + hp{u), where a bound for the remainder function 
is \p{u)\ < |upexp(|ti|). The log-likelihood logL„(/3) = l^i<„{^i/3'-Zi — exp(/3'zi)} is concave, and 
after development very similar to that of section 7A one finds 

logL„(/3o +X) -logL„(/3o) = \^^{Yi - fli)Zij x - ^x'JnX - ^Vnix), 

in which J^ = X]i<n t^i^i^'i ^^^ Vn{x) = J2i<n t^iPi^'^i)- ^^ these expressions pi = exp(/3QZi) is the 
true mean for Yi under the model. 

Before passing to a theorem we solve a relevant exercise in asymptotics of linear combinations 
of independent Poisson variables. If Yn^i is Poisson with mean Hn,i, then Yli<n0^n,i — Pn,i)xn,ii 
normed such that its variance X^i<„ ^n,ia^^ i = 1, goes to a standard normal if and only if 
^i<nPri,iP{tXn,i) -^ for cach t, which is equivalent to I]i<„^n,iP(|a:n,i|) -^ 0. This is seen 
after considering moment or cumulant generating functions. 

Theorem 7.2. Let (3n he the ML estimator based on the Erst n Poisson counts, assumed 
to foUow (7.5) for a certain (3q, with means pi = exp(/3QZi). Then Jj {fin — Po) -^a AApjO, /p} 
if and only if X]i<„ /Wi/?(| Jn Zi\) — t- 0. A simple sufRcient condition for this to hold is that 

Xn = maxi<„|J„ Zi\ is bounded and that X]i<„^i|^n Zip — )■ 0; or, equivalently, that A„ is 
bounded and that 

Nn{6) = ^iiiz'iJ-^ZiI{\.J-^''^z,\ > 5} ^ for each 5. 

i<.n 

— 1/2 

Proof: The function logL^(/3o -\- Jn x) — logL^(/3o) is concave in x and can be written 
U'^x - \\x\^ - r^(x), where Un = Jn Y.i<nO^^ ~ ^O^i ^nd where r^(x) = I]^<^MiP(^'-^n ^ ^^0- 
This is quite similar to the situation in 7A, and the maximiser Jj (/3„ — /3o) goes to a standard 
p-dimensional normal if and only if (i) r„(x) — )■ and (ii) [/„ — J-^ A/'p{0,/p}. But using the result 

1/2 

above in tandem with the Cramer-Wold theorem one sees that X]i<n P'iP{\Jn z^ | ) — )■ is necessary 
and sufficient for (ii) , and indeed also necessary and sufficient for (i) . The other statements of the 
theorem follow ^from \p{u)\ < |npexp(|n|). [] 

We note that A„ — )■ is clearly sufficient for the result to hold. 

7C. Generalised linear models. Consider a situation with independent Yfs from densities of 
the form f{yi\6i) = exp{{yi6i — b{6i))/a{(p) + c{yi,(p)}, and where 6i is parametrised as a linear 
x^/3. This is a generalised linear model with canonical link, see McCullagh and Nelder (1989). The 
likelihood in /3 is log-concave, and theorems about the large-sample behaviour of the ML estimator, 
under very weak regularity conditions, can be written down and proved by the methods exemplified 
in sections 5 and 7A. 
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too much? ^D. Pseudo-likelihood estimation in Markov chains. Suppose Xq,Xi, . . . forms a Markov chain 

on the state space {1, . . . , A;}. Instead of focusing on transition probabihties, consider direct mod- 
ehing of Xi given its neighbours, say Xgi = xqi- A very flexible and convenient class of models is 
described by 

ffl(x,|x/),) = const. exp|a,(x,) + B' HAxj.xbj)} = — ; ^- — ^- ^- — - — ^- , (7.6) 

V exn\aii') + B'Hii X!:>\V 

where 0^(1), . . . , ai(fc) are specified or unknown parameters, and f3'Hi{xi,Xdi) = J2u=i PuHi^u{xi, 
XQi) for certain component functions Hi^u that depend both on the Xi at time position i and of 
its neighbouring values XQi. For a second order Markov chain, for example, one would typically 
have Hi equal to a common H function for 2 < i < n — 2 and some special functions at the 
borders. Maximum pseudo-likelihood estimation maximises PLjj(/3) = Y\^^q fii{xi\xdi) w.r.t. the 
parameters. See Hjort and Omre (1993, section 3.2), for example, for comments on this model 
building machinery in dimensions 1 and 2, and for some comments on the difference between 
maximum PL and maximum likelihood. 

We may incorporate the ai(a;)'s in the vector Hi(x, xgi) of /^i^u-functions, for notational conve- 
nience. From Lemma A2 logPL„(/3) is concave in f3. Consider hi{xdi) = So=i Hi{j,Xdi) fi3„{j\xdi) 
and 

k 

Vi{xai) = '^{Hi{j,xai) - hi{xdi)){Hi{j,Xdi) - hi{xdi)) ff}o{J\xdi), 
j=i 
which can be interpreted as respectively E{Hi{Xi, Xdi)\xQi} and VAR{Hi{Xi,Xdi)\xdi}. After some 
work exploiting Lemma A2 one finds that 

logPL„(/3o + s/^) - logPL„(/3o) = Ks - \s'{Jn/n)s - r„(s), (7.7) 

where 

n n 

Un = n-^/^Y.^H,{X,,XQi)-hi{Xai)] and J„ = J^ y,(Xa.), 

i=0 i=0 

and where in fact rn{s) = 0{n~^/'^). The usual arguments now give \/n{f3n — /3o) -^d A^{0, J~^} 
under mild assumptions, provided the assumed model (7.6) is correct. Here J turns out to be both 
the limit of Jn/n as well as the covariance matrix in the limiting distribution for [/„. There is also 
an appropriate sandwich generalisation with covariance matrix of type J~^KJ~^ outside model 
conditions. Doing the details here properly calls for a central limit theorem and a weak law of large 
numbers for Markov chains, and such can be found in Billingsley (1961), for example. 

These Markov random field models are more important in the 2- and 3-dimensional cases, 
where one enters the world of statistical image analysis. The method above can be used to prove 
consistency of the maximum PL estimator. 

Appendix. Here we give three lemmas that were used at various stages above. They should 
also have some independent interest. 

Al. Necessary and sufficient conditions for asymptotic normality of linear combinations of 
binomials. The following result with further consequences was used in section 5. 

Lemma Al. Consider independent Bernoulli variables Yn^i ~ Bin{l,g„^i}, and real numbers 

Zn^i standardised to have J2i<n^l,i1"'-^^^ ~ ^».») = ^- '^^^"^ J2i<n^n,i(yn,i - qn,i) ^d AA{0, 1} if 
and only if 

^niS) = '^zliQn^iil - qn,i) I{\zn,i\ > S} ^ for each positive 5. (A.l) 
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Proof: The Lindeberg condition is that 

= '^zliQn^iil - qn,i) [qn,iI{\'ln,iZn,i\ > S} + {I - g„,i)/{|(l - qn,i)Zn,i\ > S}] 



iKn 



should tend to zero for each positive 6. It is not difficult to establish ^Nn{26) < Ln{S) < Nn{S), 
so (A.l) is in fact equivalent to the Lindeberg requirement. In particular (A.l) implies a M{0, 1} 
limit. 

Necessity is harder. Assume a M{0, 1} limit in distribution. We first symmetrise in the 
following fashion: Let Y^^i = Yn^i — Y^ ■ where Y^^,Y^2i ■ ■ ■ ^^^ independent copies of l^n,i) ^n,2 5 • • •, 
and let 

^n = / ^ Zn,i{Yn^i — qn,i), ^n ~ / ^ ^n,i(^ra,i ~ 9n,i)) and Z^ = -^n ~ Z^. 
i'Cn i<.n 

By assumption Zn — J-d AA{0, 2}. We first show that 

rrin = max min{|z„,i|,g„,i, 1 - g^^J -^ 0. 

z<n 

Otherwise there would be some e > such that say \zn,i\ > e and e < (7^,1 < 1 — £• Break Z„ into 
a sum oi Vn = Zn,i^,i and Wn, two independent and symmetric variables. Uniform tightness of 
Zn and symmetry imply uniform tightness of both Vn and Wn- Along some subsequence we would 
have Vn -^d V and Wn -^d W, independent with V + W distributed as A^{0,2}. By Cramer's 
theorem about convolution factors of AA{0, 2} we would have V normal. But V is not degenerate, 
and cannot be normal after all, since Vn takes only three values. This proves m„ — )■ 0. 
But this implies the usual infinitesimal array property 

maxPr{|z„^iy„^i| > (5} — )■ for each 6. 

For if \zn,i\ < S then the probability is zero, and if \zn,i\ > S then the probability is 2(7„^i(l — qn,i) < 
2m„ when n is large enough for nin < 5 to hold. Next look at page 92 of Petrov (1975). From 
limiting normality follows 



^Var [Zn,iYn,iI{\Zn,iYn,i\ < 5}] -> 2. 



If \zn,i\ ^ S the indicator here picks out Yn^i = 0, and there is no contribtion to the sum, whereas 
if \zn,i\ < S the summand is 2z^ ■g„,i(l - q^^i). Hence X^i<„ 2^ .g'„,i(l - qn,i) I{\zn,i\ < 5} -^ 1, and 
NniS) -> follows from the assumed X]i<n ^n,iQn,ii^ - qn,i) = 1- D 

The surprising thing here is that we do not need to explicitly assume maxi<„ E{z„^i(y„^i — 
qn,i)}'^ — ^ 0, as with Feller's partial converse to the Lindeberg theorem; it follows from asymptotic 
normality and the special properties of the Yn^i sequence. 

Lemma Al can next be used to address the vector case, via the Cramer-Wold theorem. We 
phrase the result as follows, to suit the development of section 5. If xi, X2, . . . is a sequence of 
p- vectors, and Yi,Y2, . . . are Bernoulli with qi,q2, ■ ■ ■, then 

Jn^^'^'^(Xi- 'li)^i ^d^fp{0,Ip}, whcrC J„ = ^ Q'i (1 - gi)XiX ■ , (A.2) 

i<.n iKn 
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if and only if 

Nn{6)='^XiJ-^Xiqi{l-q^)I{\J-^/^x^\>6}^0 for each 5 > 0. (A.3) 

This is proved by noting first that (A. 2) is equivalent to 

z<n 

for all s with length 1 and all positive delta. But N^{s,8) < Nn{S) < p'^ uiaxj <cp N^{ej , S / ^/p) , 
where Cj is the jth unit vector. 

A simple sufficient condition for (A.l) to hold is that A° = maxi<jj \zn,i\ — )■ 0, since the left 
hand side of (A.l) is bounded by A^/(5. Similarly condition (A.3) is implied by the simpler condition 
Xn = maxi<„ \Jn Xil -> 0, since Nn{6) < XnP/S. 

A2. Expansion lemma. The following result was used in section 6 and in 7A. 

Lemma A2. (i) Suppose K{t) = logi?(t), where R{t) = X]i<ra ^i 6^P('^i*) ^^^ certain nonneg- 
ative weights Wi, not all equal to zero, and arbitrary constants a^. Let Vi{t) = Wi exp{ait) / R{t) be 
the tilted and normalised weights, summing to one. Then K{t) is convex with derivatives 

K'{t) = Y,Vi{t)ai = a{t), 

i<.n 

K"{t) = Y,v^ma^-a{t)}^ 

i<.n 
K"'{t) = Y,V^{t)W^-a{t)f. 



i<n 



(a) The expansion 

log{^ti;,e"'*} - log{ J^^,} = a{0)t + i J^i;,(0){a, - a{0)}H^ + v{t) 



i<in i<n i'Cn 



holds, featuring untilted weights Vi{0) = Wi/^^^^Wi, with the following valid bounds on the 
remainder: 

Ht)\ < lfil\t\\ \vit)\ < lg{p^\t\) Y^ v,ma, - mm'. (A.4) 

Here pn = niaxi<„ \ai — a(0)| and g is the function g{u) = uexp{2u + 4n^). 

Proof: The formulae for the derivatives are proved by direct differentiation and inspection, 
and convexity follows of course from the nonnegative second derivative. To prove (ii), consider the 
exact third order Taylor expansion K{t) - K{0) = K'{0)t + ^K"{0)t^ + lK"'{s)t^ for some suitable 
s between and t. The problem is to bound the remainder term in terms of pn- 

The first bound is easy. It follows upon observing that |a(s) — a(0)| = | X]i<n ^«("*){^«~^(0)}l — 
fin and its triangle inequality consequence \ai — a{s)\ < 2pn, since this yields |K'"(s)| < {2fin)^. 
While this bound often suffices we shall have occasion to need the sharper second bound too. The 
point is to exploit the fact that s is bounded by |t| when bounding \K"'{.s)\. Start out writing 
Vi{s) = fi(0)(l + ei), where some analysis shows that exp(— 2;U„|t|) < 1 + e^ < exp(2;U„|t|). Then 

\a{s) - a(0)| = I Y, ^^i^)(^ + eO{«i - «(0)}| 

i<.n 

1/2 



Yv^{0)e^{a^-m}\<K"{0)'/'{Yv^iO)e^} 



i<.n i<.n 
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A further bound on the right hand side is K"{0)^'^Sn = K"{0)^'^ maxi<„ \ei\. This gives 
\K"'{s)\ < 2fir,Y.v,{s)\a, - a{s)\^ 

i<.n 
i<.n 

But some checking reveals 1 + 5„ < exp(2;U„|t|) and 1 + (5^ < exp(4^^|tp). This shows |i^"'(s)| < 
4|t|~-'^(/(;U„|t|)i^"(0) with the ^(-function given above. [] 

A3. Dominated convergence theorem for convergence in probability. This result was used 
several times in section 6, and a close relative was used in 4B. 

Lemma A3. Let < Xn{s,uj) < Y„,{s,u}) be jointly {s,u}) -measurable random functions on 
the interval [0, L]. Suppose A is a measure such that Yn{s) — s-p Y{s) and Xn{s) -^p X{s) for 
A almost all s and that Jl^(s)dA(s) — T-p jY{s)dX{s), a limit finite almost everywhere. Then 
/X„(s)dA(s) ^pJX{s)dX{s) too. 

Proof: It is enough to check almost sure convergence for a subsubsequence of each subse- 
quence. By convergence in probability (for uj, with s fixed) and then dominated convergence, we 
have 

7r„(e) := {IP A){(w, s): |X„(w, s) - X{uj, s)\ > e} ^ 0. 

A similar result holds for {1^}. Replace e by a sequence {Sn} decreasing to zero, then extract a 
subsubsequence along which the sequence of integrals is convergent. For some set A'^ with (IP f^ 
X)N = we get convergence for all (oj, s) G N'^ of both the A„ and the Yn subsubsequences. For 
almost all uj, therefore, X{s: {oJ,s) G N} = 0. Finally argue using the Fatou Lemma for Yn ± Xn 
along the subsubsequences to get the result. [] 

This is nice in that it circumvents the need to establish uniformity of the convergence in 
probability; this is typically more difficult to ascertain than pointwise convergence in probability. 
The lemma was used several times in sections 6 and 7A, partly in the form of the following useful 
corollary: If in particular Zn{s) — J-p for almost all s, then J Yn{s) I{\Znis)\ > 6} ds — J-p 0. It 
can also be used to simplify the Lindeberg type condition in the form of Rebolledo's martingale 
convergence theorem given in Andersen and Gill (1982, appendix). 
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